Indexing huge genome sequences for solving various problems.
نویسندگان
چکیده
Because of the increase in the size of genome sequence databases, the importance of indexing the sequences for fast queries grows. Suffix trees and suffix arrays are used for simple queries. However these are not suitable for complicated queries from huge amount of sequences because the indices are stored in disk which has slow access speed. We propose storing the indices in memory in a compressed form. We use the compressed suffix array. It compactly stores the suffix array at the cost of theoretically a small slowdown in access speed. We experimentally show that the overhead of using the compressed suffix array is reasonable in practice. We also propose an approximate string matching algorithm which is suitable for the compressed suffix array. Furthermore, we have constructed the compressed suffix array of the whole human genome. Because its size is about 2G bytes, a workstation can handle the search index for the whole data in main memory, which will accelerate the speed of solving various problems in genome informatics.
منابع مشابه
A Resource-frugal Probabilistic Dictionary and Applications in (Meta)Genomics
Genomic and metagenomic fields, generating huge sets of short genomic sequences, brought their own share of high performance problems. To extract relevant pieces of information from the huge data sets generated by current sequencing techniques, one must rely on extremely scalable methods and solutions. Indexing billions of objects is a task considered too expensive while being a fundamental nee...
متن کاملAn Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data
Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straightforward algorithms do not work for...
متن کاملCloning and determination of biochemical properties of protective and broadly conserved vaccine antigens from the genome of extraintestinal pathogenic Escherichia coli into pET28a vector
Urinary tract infections are one of the most common infectious diseases that lead to significant health problems in the world. Urinary tract infections are referred to any infection in any part of the renal system. Uropathogenic Escherichia coli, Proteus mirabilis, and Klebsiella are main organisms that are involved in these infections. After identifying same protective and conserved virulence ...
متن کاملA Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences
Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms have severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also have limited performance improvement due to r...
متن کاملMay 16 th 2012 Combinatorial Pattern Matching
1) Title: Algorithms for genome assembly Authors: Leena Salmela, Veli Mäkinen, Niko Välimäki, Johannes Ylinen, Esko Ukkonen Presenter: Leena Salmela Description: Current DNA sequencing technologies can produce huge amounts of short reads. The de novo genome assembly problem is to infer the genome of an organism based on these reads. We have studied several problems related to the assembly probl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Genome informatics. International Conference on Genome Informatics
دوره 12 شماره
صفحات -
تاریخ انتشار 2001